XORing Elephants: Novel Erasure Codes for Big Data 
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ABSTRACT 

Distributed storage systems for large clusters typically use 
replication to provide reliability. Recently, erasure codes 
have been used to reduce the large storage overhead of three- 
replicated systems. Reed-Solomon codes are the standard 
design choice and their high repair cost is often considered 
an unavoidable price to pay for high storage efficiency and 
high reliability. 

This paper shows how to overcome this limitation. We 
present a novel family of erasure codes that are efficiently re- 
pairable and offer higher reliability compared to Reed-Solomon 
codes. We show analytically that our codes are optimal on 
a recently identified tradeoff between locality and minimum 
distance. 

We implement our new codes in Hadoop HDFS and com- 
pare to a currently deployed HDFS module that uses Reed- 
Solomon codes. Our modified HDFS implementation shows 
a reduction of approximately 2 x on the repair disk I/O and 
repair network traffic. The disadvantage of the new coding 
scheme is that it requires 14% more storage compared to 
Reed-Solomon codes, an overhead shown to be information 
theoretically optimal to obtain locality. Because the new 
codes repair failures faster, this provides higher reliability, 
which is orders of magnitude higher compared to replica- 
tion. 

1. INTRODUCTION 

MapReduce architectures are becoming increasingly 
popular for big data management due to their high scal- 
ability properties. At Facebook, large analytics clusters 
store petabytes of information and handle multiple ana- 
lytics jobs using Hadoop MapReduce. Standard imple- 
mentations rely on a distributed file system that pro- 
vides reliability by exploiting triple block replication. 
The major disadvantage of replication is the very large 



storage overhead of 200%, which reflects on the cluster 
costs. This overhead is becoming a major bottleneck 
as the amount of managed data grows faster than data 
center infrastructure. 

For this reason, Facebook and many others are tran- 
sitioning to erasure coding techniques (typically, classi- 
cal Reed-Solomon codes) to introduce redundancy while 
saving storage [3J [19] , especially for data that is more 
archival in nature. In this paper we show that classical 
codes are highly suboptimal for distributed MapReduce 
architectures. We introduce new erasure codes that ad- 
dress the main challenges of distributed data reliability 
and information theoretic bounds that show the opti- 
mality of our construction. We rely on measurements 
from a large Facebook production cluster (more than 
3000 nodes, 30 PB of logical data storage) that uses 
Hadoop MapReduce for data analytics. Facebook re- 
cently started deploying an open source HDFS Module 
called HDFS RAID ((2j|)) that relies on Reed-Solomon 
(RS) codes. In HDFS RAID, the replication factor of 
"cold" (i.e., rarely accessed) files is lowered to 1 and a 
new parity file is created, consisting of parity blocks. 

Using the parameters of Facebook clusters, the data 
blocks of each large file are grouped in stripes of 10 
and for each such set, 4 parity blocks are created. This 
system (called RS (10,4)) can tolerate any 4 block fail- 
ures and has a storage overhead of only 40%. RS codes 
are therefore significantly more robust and storage ef- 
ficient compared to replication. In fact, this storage 
overhead is the minimal possible, for this level of re- 
liability [7]. Codes that achieve this optimal storage- 
reliability tradeoff are called Maximum Distance Sepa- 
rable (MDS) [31] and Reed-Solomon codes [27] form the 
most widely used MDS family. 

Classical erasure codes are suboptimal for distributed 
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environments because of the so-called Repair problem: 
When a single node fails, typically one block is lost 
from each stripe that is stored in that node. RS codes 
are usually repaired with the simple method that re- 
quires transferring 10 blocks and recreating the original 
10 data blocks even if a single block is lost [28], hence 
creating a 10 x overhead in repair bandwidth and disk 
I/O. 

Recently, information theoretic results established that 
it is possible to repair erasure codes with much less 
network bandwidth compared to this naive method [6] . 
There has been significant amount of very recent work 
on designing such efficiently repairable codes, see sec- 
tion [6] for an overview of this literature. 

Our Contributions: We introduce a new family of 
erasure codes called Locally Repairable Codes (LRCs), 
that are efficiently repairable both in terms of network 
bandwidth and disk I/O. We analytically show that our 
codes are information theoretically optimal in terms of 
their locality, i.e., the number of other blocks needed 
to repair single block failures. We present both ran- 
domized and explicit LRC constructions starting from 
generalized Reed-Solomon parities. 

We also design and implement HDFS-Xorbas, a mod- 
ule that replaces Reed-Solomon codes with LRCs in 
HDFS-RAID. We evaluate HDFS-Xorbas using experi- 
ments on Amazon EC2 and a cluster in Facebook. Note 
that while LRCs are defined for any stripe and parity 
size, our experimental evaluation is based on a RS(10,4) 
and its extension to a (10,6,5) LRC to compare with the 
current production cluster. 

Our experiments show that Xorbas enables approxi- 
mately a 2x reduction in disk I/O and repair network 
traffic compared to the Reed-Solomon code currently 
used in production. The disadvantage of the new code 
is that it requires 14% more storage compared to RS, an 
overhead shown to be information theoretically optimal 
for the obtained locality. 

One interesting side benefit is that because Xorbas 
repairs failures faster, this provides higher availability, 
due to more efficient degraded reading performance. 
Under a simple Markov model evaluation, Xorbas has 
2 more zeros in Mean Time to Data Loss (MTTDL) 
compared to RS (10,4) and 5 more zeros compared to 
3-replication. 

1.1 Importance of Repair 

At Facebook, large analytics clusters store petabytes 
of information and handle multiple MapReduce analyt- 
ics jobs. In a 3000 node production cluster storing ap- 
proximately 230 million blocks (each of size 256MB), 
only 8% of the data is currently RS encoded ('RAIDed'). 
Fig. [T] shows a recent trace of node failures in this pro- 
duction cluster. It is quite typical to have 20 or more 
node failures per day that trigger repair jobs, even when 




Figure 1: Number of failed nodes over a single 
month period in a 3000 node production cluster 
of Facebook. 



most repairs are delayed to avoid transient failures. A 
typical data node will be storing approximately 15 TB 
and the repair traffic with the current configuration is 
estimated around 10 — 20% of the total average of 2 
PB/day cluster network traffic. As discussed, (10,4) RS 
encoded blocks require approximately 10 x more net- 
work repair overhead per bit compared to replicated 
blocks. We estimate that if 50% of the cluster was RS 
encoded, the repair network traffic would completely 
saturate the cluster network links. Our goal is to de- 
sign more efficient coding schemes that would allow a 
large fraction of the data to be coded without facing this 
repair bottleneck. This would save petabytes of storage 
overheads and significantly reduce cluster costs. 

There are four additional reasons why efficiently re- 
pairable codes are becoming increasingly important in 
coded storage systems. The first is the issue of degraded 
reads. Transient errors with no permanent data loss 
correspond to 90% of data center failure events [9 19 



During the period of a transient failure event, block 
reads of a coded stripe will be degraded if the corre- 
sponding data blocks are unavailable. In this case, the 
missing data block can be reconstructed by a repair pro- 
cess, which is not aimed at fault tolerance but at higher 
data availability. The only difference with standard re- 
pair is that the reconstructed block does not have to be 
written in disk. For this reason, efficient and fast repair 
can significantly improve data availability. 

The second is the problem of efficient node decom- 
missioning. Hadoop offers the decommission feature to 
retire a faulty data node. Functional data has to be 
copied out of the node before decommission, a process 
that is complicated and time consuming. Fast repairs 
allow to treat node decommissioning as a scheduled re- 
pair and start a MapReduce job to recreate the blocks 
without creating very large network traffic. 

The third reason is that repair influences the perfor- 
mance of other concurrent MapReduce jobs. Several 
researchers have observed that the main bottleneck in 
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MapReduce is the network 5 . As mentioned, repair 
network traffic is currently consuming a non-negligible 
fraction of the cluster network bandwidth. This issue is 
becoming more significant as the storage used is increas- 
ing disproportionately fast compared to network band- 
width in data centers. This increasing storage density 
trend emphasizes the importance of local repairs when 
coding is used. 

Finally, local repair would be a key in facilitating ge- 
ographically distributed file systems across data centers. 
Geo-diversity has been identified as one of the key fu- 
ture directions for improving latency and reliability [13] . 
Traditionally, sites used to distribute data across data 
centers via replication. This, however, dramatically 
increases the total storage cost. Reed-Solomon codes 
across geographic locations at this scale would be com- 
pletely impractical due to the high bandwidth require- 
ments across wide area networks. Our work makes local 
repairs possible at a marginally higher storage overhead 
cost. 

Replication is obviously the winner in optimizing the 
four issues discussed, but requires a very large storage 
overhead. On the opposing tradeoff point, MDS codes 
have minimal storage overhead for a given reliability 
requirement, but suffer in repair and hence in all these 
implied issues. One way to view the contribution of this 
paper is a new intermediate point on this tradeoff, that 
sacrifices some storage efficiency to gain in these other 
metrics. 

The remainder of this paper is organized as follows: 
We initially present our theoretical results, the con- 
struction of Locally Repairable Codes and the infor- 
mation theoretic optimality results. We defer the more 
technical proofs to the Appendix. Section J3] presents 
the HDFS-Xorbas architecture and Section 0] discusses 
a Markov-based reliability analysis. Section [5] discusses 
our experimental evaluation on Amazon EC2 and Face- 
book's cluster. We finally survey related work in Sec- 
tion [6] and conclude in Section [3 

2. THEORETICAL CONTRIBUTIONS 

Maximum distance separable (MDS) codes are of- 
ten used in various applications in communications and 

fc)-MDS codeQ of rate 
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A (k,n 



storage systems 

R = — takes a file of size M, splits it in k equally sized 
blocks, and then encodes it in n coded blocks each of size 
¥■. Here we assume that our file has size exactly equal 
to k data blocks to simplify the presentation; larger files 
are separated into stripes of k data blocks and each 
stripe is coded separately. 



1 In classical coding theory literature, codes are denoted by 
(n, k) where n is the number of data plus parity blocks, 
classically called blocklength. A (10,4) Reed-Solomon code 
would be classically denoted by RS (n=14,k=10). RS codes 
form the most well-known family of MDS codes. 



A (k,n — fc)-MDS code has the property that any k 
out of the n coded blocks can be used to reconstruct the 
entire file. It is easy to prove that this is the best fault 
tolerance possible for this level of redundancy: any set 
of k blocks has an aggregate size of M and therefore no 
smaller set of blocks could possibly recover the file. 

Fault tolerance is captured by the metric of minimum 
distance. 

Definition 1 (Minimum Code Distance). The min- 
imum distance d of a code of length n, is equal to the 
minimum number of erasures of coded blocks after which 
the file cannot be retrieved. 

MDS codes, as their name suggests, have the largest 
possible distance which is c?mds = n—k-\-l. For example 
the minimum distance of a (10,4) RS is n — k + 1 = 5 
which means that five or more block erasures are needed 
to yield a data loss. 

The second metric we will be interested in is Block 
Locality. 

Definition 2 (Block Locality). An (k,n - k) 
code has block locality r, when each coded block is a 
function of at most r other coded blocks of the code. 

Codes with block locality r have the property that, 
upon any single block erasure, fast repair of the lost 
coded block can be performed by computing a function 
on r existing blocks of the code. This concept was re- 
cently and independently introduced in [lO 22 24 . 



When we require small locality, each single coded 
block should be repairable by using only a small subset 
of existing coded blocks r << k, even when n,k grow. 
The following fact shows that locality and good distance 
are in conflict: 

Lemma 1. MDS codes with parameters (k,n—k) can- 
not have locality smaller than k. 

Lemma [I] implies that MDS codes have the worst possi- 
ble locality since any k blocks suffice to reconstruct the 
entire file, not just a single block. This is exactly the 
cost of optimal fault tolerance. 

The natural question is what is the best locality pos- 
sible if we settled for "almost MDS" code distance. We 
answer this question and construct the first family of 
near-MDS codes with non-trivial locality. We provide a 
randomized and explicit family of codes that have log- 
arithmic locality on all coded blocks and distance that 
is asymptotically equal to that of an MDS code. We 
call such codes (k, n — k,r) Locally Repairable Codes 
(LRCs) and present their construction in the following 
section. 

Theorem 1. There exist (k,n—k,r) Locally Repairable 
codes with logarithmic block locality r — log(fc) and dis- 
tance dhRC = n — (1 + 6k) k + 1. Hence, any subset 
of k (1 + 6k) coded blocks can be used to reconstruct the 
file, where 6 k = ^ - \. 
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Observe that if we fix the code rate R = ^ of an LRC 
and let k grow, then its distance cZlrc is almost that of 
a (k, n — fc)-MDS code; hence the following corollary. 

Corollary 1. For fixed code rate R = — , the dis- 
tance of LRCs is asymptotically equal to that of (fc, n — 
k)-MDS codes 

lim = 1. 

fc->-oo dMDS 

LRCs are constructed on top of MDS codes (and the 
most common choice will be a Reed-Solomon code). 

The MDS encoded blocks are grouped in logarith- 
mic sized sets and then are combined together to ob- 
tain parity blocks of logarithmic degree. We prove that 
LRCs have the optimal distance for that specific local- 
ity, due to an information theoretic tradeoff that we 
establish. Our locality-distance tradeoff is universal in 
the sense that it covers linear or nonlinear codes and is 
a generalization of recent result of Gopalan et al. [lO] 
which established a similar bound for linear codes. Our 
proof technique is based on building an information flow 
graph gadget, similar to the work of Dimakis et a/.[6j 
[7j. Our analysis can be found in the Appendix. 

2.1 LRC implemented in Xorbas 

We now describe the explicit (10,6,5) LRC code we 
implemented in HDFS-Xorbas. For each stripe, we start 
with 10 data blocks Xi,X 2 , ■ ■ ■ ,-Xio and use a (10,4) 
Reed-Solomon over a binary extension field F 2 ™ to con- 
struct 4 parity blocks Pi, Pa, . . . , P4. This is the code 
currently used in production clusters in Facebook that 
can tolerate any 4 block failures due to the RS pari- 
ties. The basic idea of LRCs is very simple: we make 
repair efficient by adding additional local parities. This 
is shown in figure [2j 



5 file blocks 




4 RS parity blocks 



&2 r local parity block [ "3 .implied parity block 

-f-' 



Figure 2: Locally repairable code implemented 
in HDFS-Xorbas. The four parity blocks 
Pi,Pa,P3,P4 are constructed with a standard RS 
code and the local parities provide efficient re- 
pair in the case of single block failures. The main 
theoretical challenge is to choose the coefficients 
Ci to maximize the fault tolerance of the code. 

By adding the local parity S\ = ciXi+c 2 X 2 +czX 3 + 
C4X5, a single block failure can be repaired by access- 
ing only 5 other blocks. For example, if block X 3 is lost 



(or degraded read while unavailable) it can be recon- 
structed by 

X 3 - c- 1 (S 1 - CyX x - c 2 X 2 - c 4 X 4 - c 5 X 5 ). (1) 

The multiplicative inverse of the field element C3 ex- 
ists as long as C3 7^ which is the requirement we will 
enforce for all the local parity coefficients. It turns out 
that the coefficients Cj can be selected to guarantee that 
all the linear equations will be linearly independent. In 
the Appendix we present a randomized and a deter- 
ministic algorithm to construct such coefficients. We 
emphasize that the complexity of the deterministic al- 
gorithm is exponential in the code parameters (n, k) and 
therefore useful only for small code constructions. 

The disadvantage of adding these local parities is the 
extra storage requirement. While the original RS code 
was storing 14 blocks for every 10, the three local par- 
ities increase the storage overhead to 17/10. There is 
one additional optimization that we can perform: We 
show that the coefficients c\ , c 2 , . . . C10 can be chosen so 
that the local parities satisfy an additional alignment 
equation SI + S2 + S3 = 0. We can therefore not store 
the local parity S3 and instead consider it an implied 
parity. Note that to obtain this in the figure, we set 

<■', 1- 

When a single block failure happens in a RS parity, 
the implied parity can be reconstructed and used to 
repair that failure. For example, if P 2 is lost, it can 
be recovered by reading 5 blocks Pi, P3, P4, Si, S2 and 
solving the equation 



p-i = (^rH-si -s 2 - c'iPi - c 3 p 3 - c ' 4 p 4 



(2) 



In our theoretical analysis we show how to find non- 
zero coefficients Cj (that must depend on the parities Pj 
but are not data dependent) for the alignment condi- 
tion to hold. We also show that for the Reed-Solomon 
code implemented in HDFS RAID, choosing c, = IVi 
and therefore performing simple XOR operations is suf- 
ficient. We further prove that this code has the largest 
possible distance (d = 5) for this given locality r = 5 
and blocklength n = 16. 

3. SYSTEM DESCRIPTION 

HDFS-RAID is an open source module that imple- 
ments RS encoding and decoding over Apache Hadoop [2] . 
It provides a Distributed Raid File system (DRFS) that 
runs above HDFS. Files stored in DRFS are divided into 
stripes, i.e., groups of several blocks. For each stripe, a 
number of parity blocks are calculated and stored as a 
separate parity file corresponding to the original file. 
HDFS-RAID is implemented in Java (approximately 
12,000 lines of code) and is currently used in produc- 
tion by several organizations, including Facebook. 

The module consists of several components, among 
which RaidNode and BlockFixer are the most relevant 
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here: 

• The RaidNode is a daemon responsible for the cre- 
ation and maintenance of parity files for all data 
files stored in the DRFS. One node in the cluster 
is generally designated to run the RaidNode. The 
daemon periodically scans the HDFS file system 
and decides whether a file is to be RAIDed or not, 
based on its size and age. In large clusters, RAID- 
ing is done in a distributed manner by assigning 
MapReduce jobs to nodes across the cluster. Af- 
ter encoding, the RaidNode lowers the replication 
level of RAIDed files to one. 

• The BlockFixer is a separate process that runs at 
the RaidNode and periodically checks for lost or 
corrupted blocks among the RAIDed files. When 
blocks are tagged as lost or corrupted, the Block- 
Fixer rebuilds them using the surviving blocks of 
the stripe, again, by dispatching repair MapRe- 
duce (MR) jobs. Note that these are not typical 
MR jobs. Implemented under the MR framework, 
repair-jobs exploit its parallelization and schedul- 
ing properties, and can run along regular jobs un- 
der a single control mechanism. 

Both RaidNode and BlockFixer rely on an underlying 
component: ErasureCode. ErasureCode implements the 
erasure encoding/ decoding functionality. In Facebook's 
HDFS-RAID, an RS (10, 4) erasure code is implemented 
through ErasureCode (4 parity blocks are created for 
every 10 data blocks). 

3.1 HDFS-Xorbas 

Our system, HDFS-Xorbas (or simply Xorbas), is 
a modification of HDFS-RAID that incorporates Lo- 
cally Repairable Codes (LRC). To distinguish it from 
the HDFS-RAID implementing RS codes, we refer to 
the latter as HDFS-RS. In Xorbas, the ErasureCode 
class has been extended to implement LRC on top of 
traditional RS codes. The RaidNode and BlockFixer 
classes were also subject to modifications in order to 
take advantage of the new coding scheme. 

HDFS-Xorbas is designed for deployment in a large- 
scale Hadoop data warehouse, such as Facebook's clus- 
ters. For that reason, our system provides backwards 
compatibility: Xorbas understands both LRC and RS 
codes and can incrementally modify RS encoded files 
into LRCs by adding only local XOR parities. To pro- 
vide this integration with HDFS-RS, the specific LRCs 
we use are designed as extension codes of the (10,4) 
Reed-Solomon codes used at Facebook. First, a file is 
coded using RS code and then a small number of ad- 
ditional local parity blocks are created to provide local 
repairs. 

3.1.1 Encoding 



Once the RaidNode detects a file which is suitable 
for RAIDing (according to parameters set in a config- 
uration file) it launches the encoder for the file. The 
encoder initially divides the file into stripes of 10 blocks 
and calculates 4 RS parity blocks. Depending on the 
size of the file, the last stripe may contain fewer than 
10 blocks. Incomplete stripes are considered as "zero- 
padded" full-stripes as far as the parity calculation is 
concerned 

HDFS-Xorbas computes two extra parities for a total 
of 16 blocks per stripe (10 data blocks, 4 RS parities and 
2 Local XOR parities), as shown in Fig. [2] Similar to 
the calculation of the RS parities, Xorbas calculates all 
parity blocks in a distributed manner through MapRe- 
duce encoder jobs. All blocks are spread across the 
cluster according to Hadoop's configured block place- 
ment policy. The default policy randomly places blocks 
at DataNodes, avoiding collocating blocks of the same 
stripe. 

3.1.2 Decoding & Repair 

RaidNode starts a decoding process when corrupt 
files are detected. Xorbas uses two decoders: the light- 
decoder aimed at single block failures per stripe, and the 
heavy-decoder, employed when the light-decoder fails. 

When the BlockFixer detects a missing (or corrupted) 
block, it determines the 5 blocks required for the re- 
construction according to the structure of the LRC. 
A special MapReduce is dispatched to attempt light- 
decoding: a single map task opens parallel streams to 
the nodes containing the required blocks, downloads 
them, and performs a simple XOR. In the presence of 
multiple failures, the 5 required blocks may not be avail- 
able. In that case the light-decoder fails and the heavy 
decoder is initiated. The heavy decoder operates in the 
same way as in Reed-Solomon: streams to all the blocks 
of the stripe are opened and decoding is equivalent to 
solving a system of linear equations. The RS linear 
system has a Vandermonde structure [31] which allows 
small CPU utilization. The recovered block is finally 
sent and stored to a Datanode according to the clus- 
ter's block placement policy. 

In the currently deployed HDFS-RS implementation, 
even when a single block is corrupt, the BlockFixer 
opens streams to all 13 other blocks of the stripe (which 
could be reduced to 10 with a more efficient implemen- 
tation) . The benefit of Xorbas should therefore be clear: 
for all the single block failures and also many double 
block failures (as long as the two missing blocks belong 
to different local XORs) , the network and disk I/O over- 
heads will be significantly smaller. 

4. RELIABILITY ANALYSIS 

In this section, we provide a reliability analysis by es- 
timating the mean-time to data loss (MTTDL) using a 
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standard Markov model. We use the above metric and 
model to compare RS codes and LRCs to replication. 
There are two main factors that affect the MTTDL: i) 
the number of block failures that we can tolerate before 
losing data and ii) the speed of block repairs. It should 
be clear that the MTTDL increases as the resiliency 
to failures increases and the time of block repairs de- 
creases. In the following, we explore the interplay of 
these factors and their effect on the MTTDL. 

When comparing the various schemes, replication of- 
fers the fastest repair possible at the cost of low failure 
resiliency. On the other hand, RS codes and LRCs can 
tolerate more failures, while requiring comparatively 
higher repair times, with the LRC requiring less re- 
pair time than RS. In [9j, the authors report values 
from Google clusters (cells) and show that, for their pa- 
rameters, a (9,4)-RS code has approximately six orders 
of magnitude higher reliability than 3-way replication. 
Similarly here, we see how coding outperforms replica- 
tion in terms of the reliability metric of interest. 

Along with j9j, there exists significant work towards 
analyzing the reliability of replication, RAID storage [32] , 
and erasure codes [IT]. The main body of the above 
literature considers standard Markov models to analyt- 
ically derive the MTTDL for the various storage settings 
considered. Consistent with the literature, we employ a 
similar approach to evaluate the reliability in our com- 
parisons. The values obtained here may not be mean- 
ingful in isolation but are useful for comparing the var- 
ious schemes 
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(see also 

In our analysis, the total cluster data is denoted by 
C and S denotes the stripe size. We set the number of 
disk nodes to be N — 3000, while the total data stored 
is set to be C = 30PB. The mean time to failure of a 
disk node is set at 4 years (= 1/A), and the block size 
is B = 256MB (the default value at Facebook's ware- 
houses). Based on Facebook's cluster measurements, 
we limit the cross-rack communication to 7 = lGbps 
for repairs. This limit is imposed to model the real 
cross-rack communication bandwidth limitations of the 
Facebook cluster. In our case, the cross-rack communi- 
cation is generated due to the fact that all coded blocks 
of a stripe are placed in different racks to provide higher 
fault tolerance. This means that when repairing a sin- 
gle block, all downloaded blocks that participate in its 
repair are communicated across different racks. 

Under 3-way replication, each stripe consists of three 
blocks corresponding to the three replicas, and thus the 
total number of stripes in the system is C/nB where 
n = 3. When RS codes or LRC is employed, the stripe 
size varies according to the code parameters k and n—k. 
For comparison purposes, we consider equal data stripe 
size k — 10. Thus, the number of stripes is C/nB, 
where n = 14 for (10, 4) RS and n = 16 for (10, 6, 5)- 
LRC. For the above values, we compute the MTTDL 



of a single stripe (MTTDL str ipc)- Then, we normalize 
the previous with the total number of stripes to get the 
MTTDL of the system, which is calculated as 

MTTDL = MTTDLstripc . (3) 
C/nB 

Next, we explain how to compute the MTTDL of a 
stripe, for which we use a standard Markov model. The 
number of lost blocks at each time are used to denote 
the different states of the Markov chain. The failure and 
repair rates correspond to the forward and backward 
rates between the states. When we employ 3-way repli- 
cation, data loss occurs posterior to 3 block erasures. 
For both the (10,4)-RS and (10, 6, 5)-LRC schemes, 5 
block erasures lead to data loss. Hence, the Markov 
chains for the above storage scenarios will have a total 
of 3, 5, and 5 states, respectively. In Fig. [3] we show the 
corresponding Markov chain for the (10,4)-RS and the 
(10, 6, 5)-LRC. We note that although the chains have 
the same number of states, the transition probabilities 
will be different, depending on the coding scheme. 

We continue by calculating the transition rates. Inter- 
failure times are assumed to be exponentially distributed. 
The same goes for the repair (backward) times. In gen- 
eral, the repair times may not exhibit an exponential 
behavior, however, such an assumption simplifies our 
analysis. When there are i blocks remaining in a stripe 
(i.e., when the state is n— i), the rate at which a block is 
lost will be Xi = iX because the i blocks are distributed 
into different nodes and each node fails independently at 
rate A. The rate at which a block is repaired depends on 
how many blocks need to be downloaded for the repair, 
the block size, and the download rate 7. For example, 
for the 3-replication scheme, single block repairs require 
downloading one block, hence we assume pi — j/B, for 
i = 1,2. For the coded schemes, we additionally con- 
sider the effect of using heavy or light decoders. For 
example in the LRC, if two blocks are lost from the 
same stripe, we determine the probabilities for invoking 
light or heavy decoder and thus compute the expected 
number of blocks to be downloaded. We skip a detailed 
derivation due to lack of space. For a similar treatment, 
see [9j. The stripe MTTDL equals the average time it 
takes to go from state to the "data loss state". Under 
the above assumptions and transition rates, we calcu- 
late the MTTDL of the stripe from which the MTTDL 
of the system can be calculated using eqn[3j 

The MTTDL values that we calculated for replica- 
tion, HDFS-RS, and Xorbas, under the Markov model 
considered, are shown in Table [TJ We observe that 
the higher repair speed of LRC compensates for the 
additional storage in terms of reliability. This serves 
Xorbas LRC (10,6,5) two more zeros of reliability com- 
pared to a (10,4) Reed-Solomon code. The reliability of 
the 3-replication is substantially lower than both coded 
schemes, similar to what has been observed in related 








Figure 3: The Markov model used to calculate 



the MTTDL 
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Storage 


Repair 


MTTDL 


Scheme 


overhead 


traffic 


(days) 


3-replication 


2x 


lx 


2.3079.B + 10 


RS (10,4) 


0.4x 


lOx 


3.3118.B + 13 


LRC (10,6,5) 


0.6x 


5x 


1.2180^ + 15 



Table 1: Comparison summary of the three 
schemes. MTTDL assumes independent node 
failures. 

studies |9j. 

Another interesting metric is data availability. Avail- 
ability is the fraction of time that data is available for 
use. Note that in the case of 3-replication, if one block is 
lost, then one of the other copies of the block is immedi- 
ately available. On the contrary, for either RS or LRC, 
a job requesting a lost block must wait for the comple- 
tion of the repair job. Since LRCs complete these jobs 
faster, they will have higher availability due to these 
faster degraded reads. A detailed study of availability 
tradeoffs of coded storage systems remains an interest- 
ing future research direction. 

5. EVALUATION 

In this section, we provide details on a series of ex- 
periments we performed to evaluate the performance of 
HDFS-Xorbas in two environments: Amazon's Elastic 
Compute Cloud (EC2) \T\ and a test cluster in Face- 
book. 

5.1 Evaluation Metrics 

We rely primarily on the following metrics to evaluate 
HDFS-Xorbas against HDFS-RS: HDFS Bytes Read, 
Network Traffic, and Repair Duration. HDFS Bytes 
Read corresponds to the total amount of data read by 
the jobs initiated for repair. It is obtained by aggregat- 
ing partial measurements collected from the statistics- 
reports of the jobs spawned following a failure event. 
Network Traffic represents the total amount of data 
communicated from nodes in the cluster (measured in 
GB). Since the cluster does not handle any external 
traffic, Network Traffic is equal to the amount of data 
moving into nodes. It is measured using Amazon's AWS 
Cloudwatch monitoring tools. Repair Duration is sim- 
ply calculated as the time interval between the starting 



time of the first repair job and the ending time of the 
last repair job. 

5.2 Amazon EC2 

On EC2, we created two Hadoop clusters, one run- 
ning HDFS-RS and the other HDFS-Xorbas. Each clus- 
ter consisted of 51 instances of type ml. small, which 
corresponds to a 32-bit machine with 1.7 GB mem- 
ory, 1 compute unit and 160 GB of storage, running 
Ubuntu/Linux-2.6.32. One instance in each cluster served 
as a master, hosting Hadoop's NameNode, JobTracker 
and RaidNode daemons, while the remaining 50 in- 
stances served as slaves for HDFS and MapReduce, each 
hosting a DataNode and a TaskTracker daemon, thereby 
forming a Hadoop cluster of total capacity roughly equal 
to 7.4 TB. Unfortunately, no information is provided by 
EC2 on the topology of the cluster. 

The clusters were initially loaded with the same amount 
of logical data. Then a common pattern of failure events 
was triggered manually in both clusters to study the dy- 
namics of data recovery. The objective was to measure 
key properties such as the number of HDFS Bytes Read 
and the real Network Traffic generated by the repairs. 

All files used were of size 640 MB. With block size 
configured to 64 MB, each file yields a single stripe 
with 14 and 16 full size blocks in HDFS-RS and HDFS- 
Xorbas respectively. We used a block size of 64 MB, 
and all our files were of size 640 MB. Therefore, each 
file yields a single stripe with 14 and 16 full size blocks in 
HDFS-RS and HDFS-Xorbas respectively. This choice 
is representative of the majority of stripes in a produc- 
tion Hadoop cluster: extremely large files are split into 
many stripes, so in total only a small fraction of the 
stripes will have a smaller size. In addition, it allows us 
to better predict the total amount of data that needs 
to be read in order to reconstruct missing blocks and 
hence interpret our experimental results. Finally, since 
block repair depends only on blocks of the same stripe, 
using larger files that would yield more than one stripe 
would not affect our results. An experiment involving 



arbitrary file sizes, is discussed in Section [O 

During the course of a single experiment, once all files 
were RAIDed, a total of eight failure events were trig- 
gered in each cluster. A failure event consists of the 
termination of one or more DataNodes. In our failure 
pattern, the first four failure events consisted of single 
DataNodes terminations, the next two were termina- 
tions of triplets of DataNodes and finally two termi- 
nations of pairs of DataNodes. Upon a failure event, 
MapReduce repair jobs are spawned by the RaidNode 
to restore missing blocks. Sufficient time was provided 
for both clusters to complete the repair process, allow- 
ing measurements corresponding to distinct events to 
be isolated. For example, events are distinct in Fig. |4j 
Note that the Datanodes selected for termination stored 
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(a) HDFS Bytes Read per failure event, (b) Network Out Traffic per failure event, (c) Repair duration per failure event. 



Figure 4: The metrics measured during the 200 file experiment. Network-in is similar to Network-out 
and so it is not displayed here. During the course of the experiment, we simulated eight failure events 
and the x-axis gives details of the number of DataNodes terminated during each failure event and 
the number of blocks lost are displayed in parentheses. 



roughly the same number of blocks for both clusters. 
The objective was to compare the two systems for the 
repair cost per block lost. However, since Xorbas has 
an additional storage overhead, a random failure event 
would in expectation, lead to loss of 14.3% more blocks 
in Xorbas compared to RS. In any case, results can be 
adjusted to take this into account, without significantly 
affecting the gains observed in our experiments. 

In total, three experiments were performed on the 
above setup, successively increasing the number of files 
stored (50, 100, and 200 files), in order to understand 
the impact of the amount of data stored on system per- 
formance. Fig. [4] depicts the measurement from the last 
case, while the other two produce similar results. The 
measurements of all the experiments are combined in 
Fig. |6j plotting HDFS Bytes Read, Network Traffic 
and Repair Duration versus the number of blocks lost, 
for all three experiments carried out in EC2. We also 
plot the linear least squares fitting curve for these mea- 
surements. 

5.2.1 HDFS Bytes Read 

Fig. [4a] depicts the total number of HDFS bytes read 
by the BlockFixer jobs initiated during each failure event. 
The bar plots show that HDFS-Xorbas reads 41% -52% 
the amount of data that RS reads to reconstruct the 
same number of lost blocks. These measurements are 
consistent with the theoretically expected values, given 
that more than one blocks per stripe are occasionally 



lost (note that 12.14/5 = 41%). Fig. |6a| shows that the 
number of HDFS bytes read is linearly dependent on 
the number of blocks lost, as expected. The slopes give 
us the average number of HDFS bytes read per block for 
Xorbas and HDFS-RS. The average number of blocks 
read per lost block are estimated to be 11.5 and 5.8, 
showing the 2x benefit of HDFS-Xorbas. 

5.2.2 Network Traffic 



Fig.|4b]depicts the network traffic produced by Block- 
Fixer jobs during the entire repair procedure. In par- 
ticular, it shows the outgoing network traffic produced 
in the cluster, aggregated across instances. Incoming 
network traffic is similar since the cluster only commu- 
nicates information internally. In Fig. |5a| we present the 
Network Traffic plotted continuously during the course 
of the 200 file experiment, with a 5- minute resolution. 
The sequence of failure events is clearly visible. Through- 
out our experiments, we consistently observed that net- 
work traffic was roughly equal to twice the number of 
bytes read. Therefore, gains in the number of HDFS 
bytes read translate to network traffic gains, as ex- 
pected. 

5.2.3 Repair Time 

Fig. [4c] depicts the total duration of the recovery pro- 
cedure i.e., the interval from the launch time of the first 
block fixing job to the termination of the last one. Com- 
bining measurements from all the experiments, Fig. [6c] 
shows the repair duration versus the number of blocks 
repaired. These figures show that Xorbas finishes 25% 
to 45% faster than HDFS-RS. 

The fact that the traffic peaks of the two systems are 
different is an indication that the available bandwidth 
was not fully saturated in these experiments. However, 
it is consistently reported that the network is typically 
the bottleneck for large-scale MapReduce tasks [5 14 



15 . Similar behavior is observed in the Facebook pro- 
duction cluster at large-scale repairs. This is because 
hundreds of machines can share a single top-level switch 
which becomes saturated. Therefore, since LRC trans- 
fers significantly less data, we expect network saturation 
to further delay RS repairs in larger scale and hence give 
higher recovery time gains of LRC over RS. 

From the CPU Utilization plots we conclude that 
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Figure 7: Completion times of 10 WordCount 
jobs: encountering no block missing, and ~ 20% 
of blocks missing on the two clusters. Dotted 
lines depict average job completion times. 



HDFS RS and Xorbas have very similar CPU require- 
ments and this does not seem to influence the repair 
times. 

5.2.4 Repair under Workload 

To demonstrate the impact of repair performance on 
the cluster's workload, we simulate block losses in a 
cluster executing other tasks. We created two clusters, 
15 slave nodes each. The submitted artificial workload 
consists of word-count jobs running on five identical 
3GB text files. Each job comprises several tasks enough 
to occupy all computational slots, while Hadoop's FairSched- 
uler allocates tasks to TaskTrackers so that computa- 
tional time is fairly shared among jobs. Fig. [7] depicts 
the execution time of each job under two scenarios: i) 
all blocks are available upon request, and ii) almost 20% 
of the required blocks are missing. Unavailable blocks 
must be reconstructed to be accessed, incurring a de- 
lay in the job completion which is much smaller in the 
case of HDFS-Xorbas. In the conducted experiments 
the additional delay due to missing blocks is more than 
doubled (from 9 minutes for LRC to 23 minutes for RS). 

We note that the benefits depend critically on how 
the Hadoop FairScheduler is configured. If concurrent 
jobs are blocked but the scheduler still allocates slots 
to them, delays can significantly increase. Further, jobs 
that need to read blocks may fail if repair times exceed 
a threshold. In these experiments we set the scheduling 
configuration options in the way most favorable to RS. 
Finally, as previously discussed, we expect that LRCs 
will be even faster than RS in larger-scale experiments 
due to network saturation. 





All Blocks 


~ 20% of blocks missing 




Avail. 


RS 


Xorbas 


Total Bytes Read 


30 GB 


43.88 GB 


74.06 GB 


Avg Job Ex. Time 


83 min 


92 min 


106 min 



Table 2: Repair impact on workload. 

5.3 Facebook's cluster 

In addition to the series of controlled experiments 
performed over EC2, we performed one more experi- 
ment on Facebook's test cluster. This test cluster con- 
sisted of 35 nodes configured with a total capacity of 
370 TB. Instead of placing files of pre-determined sizes 
as we did in EC2, we utilized the existing set of files in 
the cluster: 3, 262 files, totaling to approximately 2.7 
TB of logical data. The block size used was 256 MB 
(same as in Facebook's production clusters). Roughly 
94% of the files consisted of 3 blocks and the remaining 
of 10 blocks, leading to an average 3.4 blocks per file. 





Blocks 


HDFS GB read 


Repair 




Lost 


Total 


/block 


Duration 


RS 


369 


486.6 


1.318 


26 min 


Xorbas 


563 


330.8 


0.58 


19 min 



Table 3: 
suits. 



Experiment on Facebook's Cluster Re- 



For our experiment, HDFS-RS was deployed on the 
cluster and upon completion of data RAIDing, a ran- 
dom DataNode was terminated. HDFS Bytes Read and 
the Repair Duration measurements were collected. Un- 
fortunately, we did not have access to Network Traf- 
fic measurements. The experiment was repeated, de- 
ploying HDFS-Xorbas on the same set-up. Results are 
shown in Table[3] Note that in this experiment, HDFS- 
Xorbas stored 27% more than HDFS-RS (ideally, the 
overhead should be 13%), due to the small size of the 
majority of the files stored in the cluster. As noted be- 
fore, files typically stored in HDFS are large (and small 
files are typically archived into large HAR files). Fur- 
ther, it may be emphasized that the particular dataset 
used for this experiment is by no means representative 
of the dataset stored in Facebook's production clusters. 

In this experiment, the number of blocks lost in the 
second run, exceed those of the first run by more than 
the storage overhead introduced by HDFS-Xorbas. How- 
ever, we still observe benefits in the amount of data read 
and repair duration, and the gains are even more clearer 
when normalizing by the number of blocks lost. 

6. RELATED WORK 

Optimizing code designs for efficient repair is a topic 
that has recently attracted significant attention due to 
its relevance to distributed systems. There is a substan- 
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Figure 5: Measurements in time from the two EC2 clusters during the sequence of failing events. 
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Figure 6: Measurement points of failure events versus the total number of blocks lost in the corre- 
sponding events. Measurements are from all three experiments. 



tial volume of work and we only try to give a high-level 
overview here. The interested reader can refer to [7j 
and references therein. 

The first important distinction in the literature is be- 
tween functional and exact repair. Functional repair 
means that when a block is lost, a different block is 
created that maintains the (n, k) fault tolerance of the 
code. The main problem with functional repair is that 
when a systematic block is lost, it will be replaced with 
a parity block. While global fault tolerance to n — k 
erasures remains, reading a single block would now re- 
quire access to k blocks. While this could be useful 
for archival systems with rare reads, it is not practical 
for our workloads. Therefore, we are interested only 
in codes with exact repair so that we can maintain the 
code systematic. 

Dimakis et al. 6|| showed that it is possible to re- 
pair codes with network traffic smaller than the naive 
scheme that reads and transfers k blocks. The first re- 
generating codes [6] provided only functional repair and 
the existence of exact regenerating codes matching the 
information theoretic bounds remained open. 

A substantial volume of work (e.g. [7j[25j|30] and ref- 
erences therein) subsequently showed that exact repair 
is possible, matching the information theoretic bound 



of j6j. The code constructions are separated into exact 
codes for low rates k/n < 1/2 and high rates k/n > 1/2. 
For rates below 1/2 (i.e. storage overheads above 2) 
beautiful combinatorial constructions of exact regen- 
erating codes were recently discovered 26 29 . Since 



replication has a storage overhead of three, for our ap- 
plications storage overheads around 1.4— 1.8 are of most 
interest, which ruled out the use of low rate exact re- 
generating codes. 

For high-rate exact repair, our understanding is cur- 
rently incomplete. The problem of existence of such 
codes remained open until two groups independently [3j 
used Interference Alignment, an asymptotic technique 
developed for wireless information theory, to show the 
existence of exact regenerating codes at rates above 1/2. 
Unfortunately this construction is only of theoretical 
interest since it requires exponential field size and per- 
forms well only in the asymptotic regime. Explicit high- 
rate regenerating codes are a topic of active research but 
no practical construction is currently known to us. A 
second related issue is that many of these codes reduce 
the repair network traffic but at a cost of higher disk 
I/O. It is not currently known if this high disk I/O is a 
fundamental requirement or if practical codes with both 
small disk I/O and repair traffic exist. 
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Another family of codes optimized for repair has fo- 
cused on relaxing the MDS requirement to improve on 
repair disk I/O and network bandwidth (e.g. 17 20 



[10]). The metric used in these constructions is locality, 
the number of blocks that need to be read to recon- 
struct a lost block. The codes we introduce are optimal 
in terms of locality and match the bound shown in [101 . 



In our recent prior work 23 we generalized this bound 
and showed that it is information theoretic (i.e. holds 
also for vector linear and non- linear codes). We note 
that optimal locality does not necessarily mean optimal 
disk I/O or optimal network repair traffic and the fun- 
damental connections of these quantities remain open. 

The main theoretical innovation of this paper is a 
novel code construction with optimal locality that relies 
on Reed-Solomon global parities. We show how the 
concept of implied parities can save storage and show 
how to explicitly achieve parity alignment if the global 
parities are Reed-Solomon. 

7. CONCLUSIONS 

Modern storage systems are transitioning to erasure 
coding. We introduced a new family of codes called 
Locally Repairable Codes (LRCs) that have marginally 
suboptimal storage but significantly smaller repair disk 
I/O and network bandwidth requirements. In our im- 
plementation, we observed 2x disk I/O and network 
reduction for the cost of 14% more storage, a price that 
seems reasonable for many scenarios. 

One related area where we believe locally repairable 
codes can have a significant impact is purely archival 
clusters. In this case we can deploy large LRCs (i.e., 
stipe sizes of 50 or 100 blocks) that can simultaneously 
offer high fault tolerance and small storage overhead. 
This would be impractical if Reed-Solomon codes are 
used since the repair traffic grows linearly in the stripe 
size. Local repairs would further allow spinning disks 
down [21] since very few are required for single block 
repairs. 

In conclusion, we believe that LRCs create a new op- 
erating point that will be practically relevant in large- 
scale storage systems, especially when the network band- 
width is the main performance bottleneck. 
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APPENDIX 

A. DISTANCE AND LOCALITY THROUGH 
ENTROPY 

In the following, we use a characterization of the code 
distance d of a length n code that is based on the en- 
tropy function. This characterization is universal in the 
sense that it covers any linear or nonlinear code designs. 

Let x be a file of size M that we wish to split and 
store with redundancy - in n blocks, where each block 
has size Without loss of generality, we assume 

that the file is split in k blocks of the same size x = 
[Xi... X k ] e F lxfc , where F is the finite field over which 
all operations are performed. The entropy of each file 
block is H{Xi) = ^f, for all i G [k], where [n] = 
{l,...,n}F\ Then, we define an encoding (generator) 
map G : ¥ lxk i-» F lxn that takes as input the k file 
blocks and outputs n coded blocks G(x) = y = [Y\ . . . Y n ], 
where H(Yi) — for all i € [ri]. The encoding func- 
tion G defines a (k,n — k) code C over the vector space 
F lx ". We can calculate the effective rate of the code as 
the ratio of the entropy of the file blocks to the sum of 
the entropies of the n coded blocks 



R 



H{X 1 ,...,X k ) 



k 
n 



(4) 



The distance d of the code C is equal to the mini- 
mum number of erasures of blocks in y after which the 
entropy of the remaining blocks is strictly less than M 

d= min \£\=n— max 151, (5) 

H({Y u ...,Y n }\S)<M H(S)<M 

where £ g 2^ Yl ''"' Yn ^ is a block erasure pattern set and 
2{ Y i>— > Y n} denotes the power set of {Yi, . . . ,Y n }, i.e., 
the set that consists of all subset of {Y\ , . . . ,Y n }. Hence, 
for a code C of length n and distance d, any n — d + 1 
coded blocks can reconstruct the file, i.e., have joint 
entropy at least equal to M. It follows that when d is 
given, n — d is the maximum number of coded variables 
that have entropy less than M . 



Equivalently, each block can be considered as a random 
variable that has entropy 
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The locality r of a code can also be defined in terms of 
coded block entropies. When a coded block Yi, i € [n], 
has locality r, then it is a function of r other coded 
variables Yi = fi(Yji(i)), where H-(i) indexes the set of 
r blocks Yj, j G lZ(i), that can reconstruct Y t , and 
fi is some function (linear or nonlinear) on these r 
coded blocks. Hence, the entropy of Yi conditioned 
on its repair group lZ(i) is identically equal to zero 
H(Yi\fi(Yn(i))) = 0, for i G [n]. This functional depen- 
dency of Yi on the blocks in lZ(i) is fundamentally the 
only code structure that we assume in our derivations]^] 
This generality is key to providing universal informa- 
tion theoretic bounds on the code distance of (k, n—k) 
linear, or nonlinear, codes that have locality r. Our fol- 
lowing bounds can be considered as generalizations of 
the Singleton Bound on the code distance when locality 
is taken into account. 

B. INFORMATION THEORETIC LIMITS OF 
LOCALITY AND DISTANCE 

We consider (k,n — k) codes that have block local- 
ity r. We find a lower bound on the distance by lower 
bounding the largest set S of coded blocks whose en- 
tropy is less than M, i.e., a set that cannot reconstruct 
the file. Effectively, we solve the following optimization 
problem that needs to be performed over all possible 
codes C and yields a best-case minimum distance 

mmmax|S| s.t.: H(S) < M, S € 2 {Yl -- Y ™ } . 

We are able to provide a bound by considering a single 
property: each block is a member of a repair group of 
size r + 1 . 

Definition 3. For a code C of length n and locality 
r, a coded block Yi along with the blocks that can gener- 
ate it, Yn(i), form a repair group T(i) — {i,lZ(i)}, for 
all i £ [n]. We refer to these repair groups, as (r + 1)- 
groups. 

It is easy to check that the joint entropy of the blocks in 
a single (r + l)-group is at most as much as the entropy 
of r file blocks 

H (Y m ) = H (Y, Y n(l) ) = H (Y ni) ) + H (Yi\Y n(i) ) 

= H(Y n(i) )< H(Y 3 ) = r M 

jeiz(i) 



k ' 



for all i £ [n]. To determine the upper bound on min- 
imum distance of C, we construct the maximum set of 
coded blocks S that has entropy less than M. We use 
this set to derive the following theorem. 



In the following, we consider codes with uniform locality, 
i.e., (fc, n — k) codes where all encoded blocks have loca lity r. 
These codes are referred to as non-canonical codes in [101. 



Theorem 2. For a code C of length n, where each 
coded block has entropy ^ and locality r, the minimum 
distance is bounded as 

~k~ 



d < 



- k 



(6) 



Proof: Our proof follows the same steps as the one in 
10 . We start by building the set S in steps and denote 
the collection of coded blocks at each step as Si. The 
algorithm that builds the set is in Fig. [H] The goal is 
to lower bound the cardinality of S, which results in an 
upper bound on code distance d, since d < n — \S\. At 
each step we denote the difference in cardinality of Si 
ans Si-\ and the difference in entropy as Si = \Si\ — 
\Si-\ \ and hi — H(Si) — H(Si-i), respectively. 



step 




1 


Set S = and i = 1 


2 


WHILE H(Si^i) < M 


3 


Pick a coded block Yj ^ Si^i 


4 


IF H(Si-iU{Y m })<M 


5 


set Si = Si-i U Yp^ 


6 


ELSE IF H(Si-i U {Y r(j) }) > M 


7 


pick y s c Y m s.t. H(y s U Si-!) < M 


8 


set Si = Si-i u y s 


9 


i = i + l 



Figure 8: The algorithm that builds set S. 

At each step (depending on the possibility that two 
(r + l)-groups overlap) the difference in cardinalities Sj 
is bounded as 1 < Si < r + 1, that is s,; = r + 1 — p, 
where |{^r(j)} n«Sj_i| = p. Now there exist two possi- 
ble cases. First, the case where the last step set Si is 
generated by line 5. For this case we can also bound 
the entropy as hi < (sj — 1)^ O Sj > j^hi + 1 which 
comes from the fact that, at least one coded variable 
in {^(j)} is a function of variables in Si-! U Yj^ijy 
Now, we can bound the cardinality |«Sj| = X)i=i s -i ^ 

ELl ( b M+ 1 ) = 1 + Ti ELl h i- We n0W have t0 

bound I and Ej=i hi- First, observe that since I is 
our "last step," this means that the aggregate entropy 
in Si should be less than the file size, i.e., it should 
have a value M — c ■ ^, for < c < 1. If c > 1 
then we could collect another variable in that set. On 
the other hand, if c = 0, then the coded blocks in 
Si would have been sufficient to reconstruct the file. 
Hence, M - ^ < Y! i= i h * < M - We shall now lower 
bound I. The smallest I' < I (i.e., the fastest) upon 
which Si' reaches an aggregate entropy that is greater 
than, or equal to M, can be found in the following way: 
if we could only collect (r + l)-groups of entropy r^, 
without "entropy losses" between these groups, i.e., if 
there were no further dependencies than the ones dic- 
tated by locality, then we would stop just before Si> 
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reached an entropy of M, that is J2i=i < M & 
l'r¥- < M <f> I' < [-1 . However, I' is an integer, hence 
I' = [-1 — 1. We apply the above to bound the cardi- 
nality |^| > fc - 1 + V > k - 1 + ["*] - 1 = fc + [£] - 2, 
in which case we obtain d < n — [-1 — fc + 2. 

We move to the second case where we reach line 6 of 
the building algorithm: the entropy of the file can be 
covered only by collecting (r + 1) groups. This depends 
on the remainder of the division of M by r ^ . Posterior 
to collecting the (r + l)-groups, we are left with some 
entropy that needs to be covered by at most r — 1 ad- 
ditional blocks not in Si . The entropy not covered by 



the set Si' is M 



l>rf 



M-(r*|-l)r 



M 



\r1 T" + r T~- ^° covcr that wc need an additional num- 
ber of blocks s > 



M-l'r- 



k-l'r = k-(\±] -1) r. 



Hence, our final set Si has size 



\Si\ + s - 1 = l(r + 1) + a - 1 > l'(r + 1) + k - 



-1-1 



- 1 (r + 1) + k - 





"fc" 






< 




-0- 






r 




r 



fc -2. 



Again, due to the fact that the distance is bounded by 
n - \S\ we have d < n - \£] - k + 2. □ 
From the above proof we obtain the following corol- 
lary. 

Corollary 2. In terms of the code distance, non- 
overlapping (r + l)-groups are optimal. 

In [10] , it was proven that (fc, n—k) linear codes have 
minimum code distance that is bounded as d < n — k — 
[£] +2. As we see from our distance-locality bound, the 
limit of linear codes is information theoretic optimal, 
i.e., linear codes suffice to achieve it. Indeed, in the 
following we show that the distance bound is tight and 
we present randomized and explicit codes that achieve 
it0 

C. ACHIEVABILITY OF THE BOUND 

In this section, we show that the bound of Theorem 
[2] is achievable using a random linear network coding 
(RLNC) approach as the one presented in uE\ Our proof 
uses a variant of the information flow graph that was 
introduced in (|6j . We show that a distance d is feasible 
if a cut-set bound on this new flow graph is sufficiently 
large for multicast sessions to run on it. 

In the same manner as [6] , the information flow graph 
represents a network where the k input blocks are de- 
picted as sources, the n coded blocks are represented 
as intermediate nodes of the network, and the sinks of 
the network are nodes that need to decode the k file 
blocks. The innovation of the new flow graph is that it 

4 In our following achievability proof of the above infor- 
mation theoretic bound we assume that (r + l)|n and we 



consider non-overlapping repair groups. 
r(i) = r(j) for all i,j€T(i). 



This means that 



is "locality aware" by incorporating an appropriate de- 
pendency subgraph that accounts for the existence of 
repair groups of size (r + 1). The specifications of this 
network, i.e., the number and degree of blocks, the edge- 
capacities, and the cut-set bound are all determined by 
the code parameters k,n— k,r,d. For coding parame- 
ters that do not violate the distance bound in Theorem 
[2j the minimum s — t cut of such a flow graph is at 
least M. The multicast capacity of the induced net- 
work is achievable using random linear network codes. 
This achievability scheme corresponds to a scalar linear 
code with parameters k,n — k, r, d. 




Xi : 7-th file block (source) 



(r+l)-group flow- bottleneck 



coded block 




DC; : i-th Data Collector (sink) 



Figure 9: The G(k,n 
graph. 



k,r,d) information flow 



In Fig. [9j we show the general structure of an infor- 
mation flow graph. We refer to this directed graph as 
Q(k,n — k,r,d) with vertex set 

V = {{X i ;te [fc]}, {Tf,Ty*;je[n]}, 
{y/ n ,y? ut ;jG [n]}, {DC,; Vie [T]}}. 

The directed edge set is implied by the following edge 
capacity function 



c e (v,u) 



oo,(«,u)G ({X,;ie [fc]},{ri";je [^1}) 

u (j r ni e [ffi]}.{^^H}) 

ufj^ieW}, {DC,;i6[T]}), 

(«,«) e e [n]} , {y° ut j e [»]}) , 



M 

k 

0, otherwise 



The vertices {Xi; i € [fc]} correspond to the fc file blocks 
and {Yj ?ut ; j € [n]} correspond to the coded blocks. The 
edge capacity between the in- and out- vertices cor- 
responds to the entropy of a single coded block. When, 
r + 1 blocks are elements of a group, then their "joint 
flow," or entropy, cannot exceed r¥-. To enforce this 
entropy constraint, we bottleneck the in-flow of each 
group by a node that restricts it to be at most r¥. For 
a group r(z), we add node T- n that receives flow by the 
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sources and is connected with an edge of capacity 
to a new node F° ut . The latter connects to the r + 1 
blocks of the i-th group. The file blocks travel along 
the edges of this graph towards the sinks, which we call 
Data Collectors (DCs). A DC needs to connect to as 
many coded blocks as such that it can reconstruct the 
file. This is equivalent to requiring s — t cuts between 
the file blocks and the DCs that are at least equal to 
M, i.e., the file size. We should note that when we are 
considering a specific group, we know that any block 
within that group can be repaired from the remaining r 
blocks. When a block is lost, the functional dependence 
among the blocks in an (r + l)-group allow a newcomer 
block to compute a function on the remaining r blocks 
and reconstruct what was lost. 

Observe that if the distance of the code is d, then 
there are T — ( n _^ +1 ) DCs, each with in-degree n — 
d + 1, whose incident vertices originate from n — d + 1 
blocks. The cut-set bound of this network is defined 
by the set of minimum cuts between the file blocks and 
each of the DCs. A source-DC cut in Q(k,n — k,r,d) 
determines the amount of flow that travels from the file 
blocks to the DCs. When d is consistent with the bound 
of Theorem [2| the minimum of all the s — t cuts is at 
least as much as the file size M. The following lemma 
states that if d is consistent with the bound of Theorem 
[2j then the minimum of all the cuts is at least as much 
as the file size M. 

Lemma 2. The minimum source-DC cut in Q(k,n — 
k, r, d) is at least M , when d < n — ["£1 — k + 2. 

Proof : Omitted due to lack of space. □ 
Lemma [2] verifies that for given n,k,r, and a valid dis- 
tance d according to Theorem [2j the information flow 
graph is consistent with the bound: the DCs have enough 
entropy to decode all file blocks, when the minimum cut 
is more than M. The above results imply that the flow 
graph Q(k,n — k, r, d) captures both the blocks locality 
and the DC requirements. Then, a successful multi- 
cast session on Q(k,n — k, r, d) is equivalent to all DCs 
decoding the file. 

Theorem 3. If a multicast session onQ(k,n—k,r,d) 
is feasible, then there exist a (k,n—k) codeC of locality 
r and distance d . 

Hence, the random linear network coding (RLNC) 
scheme of Ho et al. [16|| achieves the cut-set bound 
of Q r {k,n— k,r,d), i.e., there exist capacity achieving 
network codes, which implies that there exist codes that 
achieve the distance bound of Theorem[2] Instead of the 
RLNC scheme, we could use the deterministic construc- 
tion algorithm of Jaggi et al. 18 to construct explicit 



Lemma 3. For a network with E edges, k sources, 
and T destinations, where rj links transmit linear combi- 
nation of inputs, the probability of success of the RLNC 

scheme is at least ^1 — . Moreover, using the algo- 
rithm in a deterministic linear code over F can be 
found in time O (ETk(k + T)). 

The number of edges in our network is E = "( fc +j^+ 3 ) _|_ 
(n — d + l)( fc+ |"fc"|_ 1 ) hence we can calculate the com- 
plexity order of the deterministic algorithm, which is 
ETk(k + T) = (T 3 k 2 ) = O (k 2 8 nH2 (xi^m)\ where 

^(•) is the binary entropy function. The above and 
Lemma [3] give us the following existence theorem 

Theorem 4. There exists a linear code over F with 
locality r and length n, such that (r + l)\n, that has 
distance d = n- [£] -k + 2, if |F| = q > ( k+ rh_i) = 

O (2 nH ^T^w)\ . Moreover, we can construct explicit 



capacity achieving linear codes for multicast networks. 
Using that scheme, we could obtain in time polynomial 
in T explicit (fc, n — k) codes of locality r. 



codes overF, with |F| = q, in time O (k 2 8 nH2 ^^^ n ^ . 

Observe that by setting r = log(fe), we obtain Theo- 
rem [T] Moreover, we would like to note that if for each 
(r + l)-group we "deleted" a coded block, then the re- 
maining code would be a (k, n' — fc)-MDS code, where 
nf = n — ;rpY, assuming no repair group overlaps. This 
means that LRCs are constructed on top of MDS codes 
by adding r-degree parity coded blocks. A general con- 
struction that operated over small fields and could be 
constructed in time polynomial in the number of DCs 
is an interesting open problem. 

D. AN EXPLICIT LRC USING REED- 
SOLOMON PARITIES 

We design a (10, 6, 5)-LRC based on Reed-Solomon 
Codes and Interference Alignment. We use as a basis for 
that a (10, 4)-RS code defined over a binary extension 
field F 2 ™ . We concentrate on these specific instances of 
RS codes since these are the ones that are implemented 
in practice and in particular in the HDFS RAID com- 
ponent of Hadoop. We continue introducing a general 
framework for the desing of (fc, n — k) Reed-Solomon 
Codes. 

The k x n (Vandermonde type) parity-check matrix 
of a (k,n— fc)-RS code defined over an extended binary 
field F 2 m, of order q = 2 m , is given by = alC*, 

where ao, <zi, . . . , a n -i are 71 distinct elements of the 
field F 2 ™ . The order of the field has to be q > n. The 
rt — 1 coefficients ao, a%, . . . , a„_i are n distinct elements 
of the field F 2 m . We can select a to be a generator el- 
ement of the cyclic multiplicative group defined over 
F 2 ™ . Hence, let a be a primitive clement of the field 
F 2 ™. Then, [H]^ = c^- 1 ^" 1 ), for i e [k],j G [n]. 
The above parity check matrix defines a (k,n — fc)-RS 
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code. It is a well-known fact, that due to its determi- 
nant structure, any (n — k) x (n — k) sub-matrix of H 
has a nonzero determinant, hence, is full-rank. This, 
in terms, means that a (fc, n — fc)-RS defined using the 
parity check matrix H is an MDS code, i.e., has optimal 
minimum distance d = n — k + 1. We refer to the k x n 
generator matrix of this code as G. 

Based on a (14, 10)-RS generator matrix, we will in- 
troduce 2 simple parities on the first 5 and second 5 
coded blocks of the RS code. This, will yield the gen- 
erator matrix of our LRC 



Glrc 



G 



5 

£« 

i=l 



10 

E< 

i=6 



(7) 



■ l 

1:10 



where gi denotes the i-th column of G, for i e [14]. 
We would like to note that even if Glrc is not in sys- 
tematic form, i.e., the first 10 blocks are not the ini- 
tial file blocks, we can easily convert it into one. To 
do so we need to apply a full-rank transformation on 
the rows of Glrc in the following way: AGlrc 
A[G :> i : i G : , 11:15] = [Iio AG : ,ii:i 5 ], where A = G; 
and G-. : i:j is a submatrix of G that consists of columns 
with indices from i to j. This transformation renders 
our code systematic, while retaining its distance and lo- 
cality properties. We proceed to the main result of this 
section. 

Theorem 5. The code C of length 16 defined by Glrc 
has locality 5 for all coded blocks and optimal distance 
d = 5. 

Proof: We first prove that all coded blocks of Glrc 
have locality 5. Instead of considering block locality, we 
can equivalently consider the locality of the columns of 
Glrc , without loss of generality. First let i £ [5] . Then, 
gi can be reconstructed from the XOR parity X^j=i Sj 
if the 4 other columns gi, j £ {6, . . . , 10}\i, are sub- 
tracted from it. The same goes for i £ {6, . . . , 10}, 
i.e., gi can be reconstructed by subtracting gj, for j £ 



{6, . . . , 10}\i, from the XOR parity X^j=6 Sj- However, 
it is not straightforward how to repair the last 4 coded 
blocks, i.e., the parity blocks of the systematic code rep- 
resentation. At this point we make use of Interference 
Alignment. Specifically, we observe the following: since 
the all-ones vector of length n is in the span of the rows 
of the parity check matrix H, then it has to be orthog- 
onal to the generator matrix G, i.e., G1 T = Ofe X i due 
to the fundamental property GH T = 0kx(n-k)- This 
means that G1 T = Ofe X i <^=> J2iti Si = Ofexi and any 
columns of Glrc between the 11-th and 14-th are also 
a function of 5 other columns. For example, for Y\\ 

observe that we have gu = (j2i=i Sij + (X«°6 S«) + 

gi2 + gi3+gi4, where (j^i=i Sij is the first XOR parity 

and ( ^2 i=e Si ) is the second and "— "s become "+"s due 
to the binary extended field. In the same manner as 
gu, all other columns can be repaired using 5 columns 
of G L rc- Hence all coded blocks have locality 5. 

It should be clear that the distance of our code is at 
least equal to its (14, 10)-RS precode, that is, d > 5. 
We prove that d = 5 is the maximum distance possible 
for a length 16 code has block locality 5. Let all codes 
of locality r — 5 and length n = 16 for M = 10. Then, 
there exist 6-groups associated with the n coded blocks 
of the code. Let, Yp/^ be the set of 6 coded blocks in 
the repair group of i £ [16]. Then, H(Y T (i)) < 5, for 
all i £ [16]. Moreover, observe that due to the fact that 
5/16 there have to exist at least two distinct overlap- 
ping groups Yr(n) an d ^r(j 2 )' *i>*2 £ [16], such that 

|ir«,iny T 



r(*i) i i J r( J2 )| 

of |y r(il) u y r(i2) | 



> 1. Hence, although the cardinality 
is 11 its joint entropy is bounded as 
H{Y T{il) ,Y T{i2) ) = H{Y n[il) ) + H{Y n{i2) \Y n[il) ) < 10, 
i.e., at least one additional coded block has to be in- 
cluded to reach an aggregate entropy of M = 10. There- 
fore, any code of length n = 16 and locality 5 can have 
distance at most 5, i.e., d — 5 is optimal for the given 
locality. □ 
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